8/26/2018

What is ResPlat?

  • Cloud & High Performance Computing
  • Data Storage & Management
  • Training & Community
  • Part of University Services: Here for all academic departments

Related Courses

Improving the speed of your R code

“R is not a fast language. This is not an accident. R was purposely designed to make data analysis and statistics easier for you to do. It was not designed to make life easier for your computer.”

Hadley Wickham (Advanced R)

Today

  • Part 1: Profiling (why is my code slow)
  • Memory?
  • Part 2: Optimization (fix my slow code)
  • Part 3: Parallellization (run multiple bits of code at once)
  • Part 4: Cloud & High Performance Computing (f%^$ it, gimme a bigger computer please)

Goal: Less waiting, more research.

After the Hangover

Part 1: Profiling

  • Where is our code slow?:
    • Remember we want to work smarter not harder: It’s only worth optimising the code that could make a significant difference.
    • Let’s find the bottleneck(s)!

Time your code with Microbenchmark

library(microbenchmark)

x <- runif(100)
microbenchmark(
  sqrt(x),
  x ^ 0.5
)
## Unit: nanoseconds
##     expr  min     lq    mean median     uq    max neval
##  sqrt(x)  489  623.5 6074.52  832.5 1183.5 464075   100
##    x^0.5 2760 4277.5 5608.14 4584.0 4957.0  26877   100

Instead of using microbenchmark(), you could use the built-in function system.time(). But system.time() is much less precise.

What is overhead?

Before Optimising Orginise your code

When tackling a bottleneck, you???re likely to come up with multiple approaches. Write a function for each approach, encapsulating all relevant behaviour. This makes it easier to check that each approach returns the correct result and to time how long it takes to run.

66.Optimisation Examples:

  • read.csv(): specify known column types with colClasses.

  • factor(): specify known levels with levels.

  • cut(): don???t generate labels with labels = FALSE if you don???t need them, or, even better, use findInterval() as mentioned in the ???see also??? section of the documentation.

  • unlist(x, use.names = FALSE) is much faster than unlist(x).

  • interaction(): if you only need combinations that exist in the data, use drop = TRUE.

  • If you???re converting continuous values to categorical make sure you know how to use cut() and findInterval().

Vectorisation:

Using vectorisation for performance means finding the existing R function that is implemented in C and most closely applies to your problem. The loops in a vectorised function are written in C instead of R. Loops in C are much faster because they have much less overhead.

rowSums(), colSums(), rowMeans(), and colMeans(). These vectorised matrix functions will always be faster than using apply(). You can sometimes use these functions to build other vectorised functions.

The loops in a vectorised function are written in C instead of R. Loops in C are much faster because they have much less overhead.

Avoid Copies:

A pernicious source of slow R code is growing an object with a loop. Whenever you use c(), append(), cbind(), rbind(), or paste() to create a bigger object, R must first allocate space for the new object and then copy the old object to its new home.

Byte Code Compile your function

lapply2 <- function(x, f, ...) {
  out <- vector("list", length(x))
  for (i in seq_along(x)) {
    out[[i]] <- f(x[[i]], ...)
  }
  out
}

lapply2_c <- compiler::cmpfun(lapply2)

All base R functions are byte code compiled by default.

Parallelise

Copy code in Advanced R for mac + windows. Discuss differences.